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The logical basis for information theory is the newly developed logic of partitions that is 
dual to the usual Boolean logic of subsets. The key concept is a "distinction" of a partition, an 
ordered pair of elements in distinct blocks of the partition. The logical concept of entropy based 
on partition logic is the normalized counting measure of the set of distinctions of a partition 
on a finite set-just as the usual logical notion of probability based on the Boolean logic of 
subsets is the normalized counting measure of the subsets (events). Thus logical entropy is a 
measure on the set of ordered pairs, and all the compound notions of entropy (join entropy, 
conditional entropy, and mutual information) arise in the usual way from the measure (e.g., 
the inclusion-exclusion principle)~just like the corresponding notions of probability. The usual 
Shannon entropy of a partition is developed by replacing the normalized count of distinctions 
(dits) by the average number of binary partitions (bits) necessary to make all the distinctions 
of the partition. 
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1 Introduction 

Information is about making distinctions or differences. In James Gleick's book, The Information: A 
History, A Theory, A Flood, he noted the focus on differences in the seventeenth century polymath, 
John Wilkins, who was a founder of the Royal Society. In 1641, the year before Newton was born, 
Wilkins published one of the earliest books on cryptography, Mercury or the Secret and Swift 
Messenger, which not only pointed out the fundamental role of differences but noted that any 
(finite) set of different things could be encoded by words in a binary code. 

For in the general we must note, That whatever is capable of a competent Difference, 
perceptible to any Sense, may be a sufficient Means whereby to express the Cogitations. 
It is more convenient, indeed, that these Differences should be of as great Variety as the 
Letters of the Alphabet; but it is sufficient if they be but twofold, because Two alone 
may, with somewhat more Labour and Time, be well enough contrived to express all 
the rest. [30, Chap. XVII, p. 69] 

Wilkins explains that a five letter binary code would be sufficient to code the letters of the alphabet 
since 2 5 = 32. 

Thus any two Letters or Numbers, suppose A.B. being transposed through five Places, 
will yield Thirty Two Differences, and so consequently will superabundantly serve for 
the Four and twenty Letters... . [30|, Chap. XVII, p. 69] 

As Gleick noted: 
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Any difference meant a binary choice. Any binary choice began the expressing of cogi- 
tations. Here, in this arcane and anonymous treatise of 1641, the essential idea of infor- 
mation theory poked to the surface of human thought, saw its shadow, and disappeared 
again for [three] hundred years. \12\ p. 161] 

In this paper, we will start afresh by deriving an information-as-distinctions notion of logical 
entropy [7] from the new logic of partitions [8] that is mathematically dual to the usual Boolean 
logic of subsets. Then the usual Shannon entropy [27] will be essentially derived from the concepts 
behind logical entropy as another way to measure information-as-distinctions. This treatment of the 
various notions of Shannon entropy (e.g., mutual, conditional, and joint entropy) will also explain 
why their interrelations can be represented using a Venn diagram picture [5]. 

2 Logical Entropy 
2.1 Partition logic 

The logic normally called " prepositional logic" is a special case of the logic of subsets originally 
developed by George Boole [I]. In the Boolean logic of subsets of a fixed non-empty universe set 
U, the variables in formulas refer to subsets S C U and the logical operations such as the join 
SVT, meet S AT, and implication S =>• T are interpreted as the subset operations of union S U T, 
intersection S n T, and the conditional S =>■ T = S c U T. Then " prepositional" logic is the special 
case where U = 1 is the one-element set whose subsets and 1 are interpreted as the truth values 
and 1 (or false and true) for propositions. 

In subset logic, a valid formula or tautology is a formula such as [S A (S => T)] => T where for 
any non-empty U, no matter what subsets of U are substituted for the variables, the whole formula 
evaluates to U by the subset operations. It is a theorem that if a formula is valid just for the special 
case of U = 1 (i.e., as in a truth table tautology), then it is valid for any U. But in today's textbook 
treatments of so-called "prepositional" logic, the truth-table version of a tautology is usually given 
as a definition, not ELS £1 theorem in subset logic. 

What is lost by restricting attention to the special case of propositional logic rather than the 
general case of subset logic? At least two things are lost, and both are relevant for our development. 

• Firstly if it is developed as the logic of subsets, then it is natural, as Boole did, to attach a 
quantitative measure to each subset S of a finite universe U, namely the normalized counting 

I S I 

measure |j4 which can be interpreted as the logical probability Pr (S) (where the elements of 
U are assumed equiprobable) of randomly drawing an element from S. 

• Secondly, the notion of a subset (unlike the notion of a proposition) has a mathematical dual 
in the notion of a quotient set, as is evidenced by the dual interplay between subobjects 
(subgroups, subrings,...) and quotient objects throughout abstract algebra. 

This duality is the " turn-around-the-arrows" category-theoretic duality, e.g., between monomor- 
phisms and epimorphisms, applied to sets [2D]. The notion of a quotient set of U is equivalent to 
the notion of an equivalence relation on U or a partition tt = {B} of U. When Boole's logic is seen 
as the logic of subsets (rather than propositions), then the notion arises of a dual logic of partitions 
which has now been developed [8]. 
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2.2 Logical Entropy 

A partition it = {B} on a finite set U is a set of non-empty disjoint subsets B ("blocks" of the 
partition) of U whose union is U. The idea of information-as-distinctions is made precise by defining 
a distinction or dit of a partition ir = {B} of U as an ordered pair (u, u') of elements it, it' G U 
that are in different blocks of the partition. The notion of "a distinction of a partition" plays the 
analogous role in partition logic as the notion of "an element of a subset" in subset logic. The set of 
distinctions of a partition it is its dit set dit (ir). The subsets of U are partially ordered by inclusion 
with the universe set U as the top of the order and the empty set as the bottom of the order. A 
partition ir = {B} refines a partition a = {C}, written a ^ 7T, if each block B G ir is contained in 
some block C G a. The partitions of U are partially ordered by refinement which is equivalent to 
the inclusion ordering of dit sets. The discrete partition 1 = {{u}} ue u, where the blocks are all the 
singletons, is the top of the order, and the indiscrete partition = {U} (with just one block U) is 
the bottom. Only the self-pairs (it, it) G A C U x U of the diagonal A can never be a distinction. 
All the possible distinctions U x U — A are the dits of 1 and no dits are distinctions of just as 
all the elements are in U and none in 0. 

In this manner, we can construct a table of analogies between subset logic and partition logic. 





Subset logic 


Partition logic 


'Elements' 


Elements u of S 


Dits (u, v!) of 7T 


Order 


Inclusion S CT 


Refinement: dit {a) C dit (ir) 


Top of order 


U all elements 


dit(l) = U 2 - A, all dits 


Bottom of order 


no elements 


dit(O) = 0, no dits 


Variables in formulas 


Subsets S of U 


Partitions 7r on U 


Operations 


Subset ops. 


Partition ops. [8] 


Formula &(x,y, ...) holds 


u element of $(5, T, ...) 


(u,u') dit of ®(tt,(t, ...) 


Valid formula 


$(S,T,...) = U, VS,T,... 


$(7T,(7, ...) = 1, V7T,a, ... 



Table of analogies between subset and partition logics 



A dit set dit (it) of a partition on U is a subset of U x U of a particular kind, namely the com- 
plement of an equivalence relation. An equivalence relation is reflexive, symmetric, and transitive. 
Hence the complement is a subset P C U x U that is: 

1. irreflexive (or anti-reflexive), P n A = 0; 

2. symmetric, (u, u') € P implies (u', u) € P; and 

3. anti-transitive (or co-transitive), if (u, u") G P then for any u' G [7, (n, v!) G P or (n', n") G P, 

and such binary relations will be called partition relations (also called apartness relations). 

Given any subset S Q U x U , the reflexive-symmetric-transitive (rst) closure S c of the comple- 
ment S c is the smallest equivalence relation containing S c , so its complement is the largest partition 
relation contained in S, which is called the interior int (S) of S. This usage is consistent with calling 
the subsets that equal their rst-closures closed subsets of U x U (so closed subsets = equivalence 
relations) so the complements are the open subsets (= partition relations). However it should be 
noted that the rst-closure is not a topological closure since the closure of a union is not necessarily 
the union of the closures, so the "open" subsets do not form a topology on U x U. 
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The interior operation int : p (U x U) — > p (U x C7) provides a universal way to define opera- 
tions on partitions from the corresponding subset operations: 

apply the subset operation to the dit sets and then, if necessary, take the interior to 
obtain the dit set of the partition operation. 

Given partitions vr = {B} and a = {C} on U, their join vr V cr is the partition whose dit set 
dit (vr V cr) is the interior of dit (vr) U dit (cr) (since the union U is the subset join operation). But 
the union of partition relations (open subsets) is a partition relation (open subset) so that: 

dit (vr V a) = dit (vr) U dit (cr). 

This gives the same join vr V cr as the usual definition which is the partition whose blocks are the 
non-empty intersections B(~)C for B G vr and C € cr. To define the meet vr Act of the two partitions, 
we apply the subset meet operation of intersection to the dit sets and then take the interior (which 
is necessary in this case): 

dit (vr A a) = int [dit (vr) n dit (cr)]. 

This gives the same result as the usual definition of the partition meet in the literature0 Perhaps 
surprisingly, the other logical operations such as the implication do not seem to be defined for 
partitions in the literature. Since the subset operation of implication is S =>• T = S c U T, we define 
the partition implication a =^ vr as the partition whose dit set is: 

dit (vr => cr) = int [dit (cr) c U dit (vr)]H 

The refinement partial order a -< vr is just inclusion of dit sets, i.e., a -< vr iff dit (cr) C dit (vr). If 
we denote the lattice of partitions (using the refinement ordering) as II (U), then the mapping: 

dit :U{U) ^ p(U xU) 
Dit set representation of partition lattice 

represents the lattice of partitions as the lattice O (U x U) of open subsets (under inclusion) of 
P (UxU). 

For any finite set X, a (finite) measure fj, is a function [i : p (X) — > R such that: 

1. m(0) = o, 

2. for any E C X, \i (E) > 0, and 

3. for any disjoint subsets E\ and E%, (jl(E\ U E2) = n {E\) + \i (E-i). 

x But note that many authors think in terms of equivalence relations instead of partition relations and thus reverse 
the definitions of the join and meet. Hence their "lattice of partitions" is really the lattice of equivalence relations, 
the opposite of the partition lattice If (U) defined here with refinement as the ordering relation. 

2 The equivalent but more perspicuous definition of a => tt is the partition that is like 7r except that whenever a 
block B £ 7r is contained in a block C G cr, then B is 'discretized' in the sense of being replaced by all the singletons 
{11} for u G B. Then it is immediate that the refinement a < it holds iff a =>■ 7r = 1, as we would expect from the 
corresponding relation, S C T iff S T — S c U T = U, in subset logic. 
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Any finite set X has the counting measure \ \ : p (X) — > R and normalized counting measure 
■kL : p (X) — > R defined on the subsets of X. Hence for finite U, we have the counting measure 



and the normalized counting measure 
I I 



\UxU\ 



defined on p(U x U). Boole used the normalized 



counting measure t^t defined on the power-set Boolean algebra p(U) to define the logical probability 

\s\ 

Pr (S) = m of an event 5 C [7.0] In view of the analogy between elements in subset logic and dits 
in partition logic, the construction analogous to the logical probability is the normalized counting 
measure applied to dit sets. That is the definition of the: 

h (iA — jjMzOJ 
11 W - WxUJ 

Logical entropy of a partition tt. 

Thus the logical entropy function h () is the dit set representation composed with the normalized 
counting measure: 



h:U(U) 



U(U)^%p(UxU) 



Logical entropy function 
One immediate consequence is the inclusion-exclusion principle: 

~ ' dit( | S ((r)l =h(ir)+h(a) 



|dit(7r)ndit(o-)| 
WxU] 



JditWI 

\UxU\ 



+ 



\UxU\ 



h(ir Vcr) 



|dit(7r)ndit(o-)| 



\UxU\ 



as the "logical mutual infor- 



which provides the motivation for our definition below of 
mation" of the partitions it and a. 

In a random (i.e., equiprobable) drawing of an element from U, the event S occurs with the 
probability Pr(5). If we take two independent (i.e., with replacement) random drawings from U, 
i.e., pick a random ordered pair from UxU, then h (tt) is the probability that the pair is a distinction 
of tt, i.e., that tt distinguishes. These analogies are summarized in the following table which uses 
the language of probability theory (e.g., set of outcomes, events, the occurrence of an event): 





Subset logic 


Partition logic 


'Outcomes' 


Elements u of S 


Ordered pairs (u, u') € U x U 


'Events' 


Subsets S of U 


Partitions tt of U 


'Event occurs' 


ueS 


(u,u') G dit (tt) 


Norm, counting measure 


(S) = ^ 


u f \ _ |dit(ff)| 
n \ n ) \UxU\ 


Interpretation 


Prob. event S occurs 


Prob. partition tt distinguishes 


Table of quantitative analogies between su 


aset and partition logics. 



Thus logical entropy h(Tr) is the simple quantitative measure of the distinctions of a partition 
tt just as the logical probability Pr (S) is the quantitative measure of the elements in a subset S. 
In short, information theory is to partition logic as probability theory is to ordinary subset logic. 

To generalize logical entropy from partitions to finite probability distributions, note that: 



dit(vr) = {B x B' : B,B' € vr, B ^ B'} = U xU - {B x B : B G vr}. 
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Using pb = O, we have: 

An ordered pair (u, u') £ B x B for some 1? £ tt is an indistinction or iridic of 7r where indit (ir) = 
U x U — dit (7r). Hence in a random drawing of a pair from U x U, "YIb^-kPb ls ^ ne probability of 
drawing an indistinction, while h (7r) = 1 — 'YIb^Pb ls ^ ne probability of drawing a distinction. 

Entropies will be defined both for partitions on finite sets and for finite probability distributions 
(i.e., finite random variables). Given a random variable u with the probability distribution p = 
(pi,...,p n ) over the n distinct values U = {ui, u n }, a distinction of the discrete partition on U 
is just a pair (uj, Uj) with i ^ j and with the probability PiPj- Applying the previous notion to the 
logical entropy of a partition to this case with ps = p% (where B = {ui}), we have the: 

h(p) = l-EiPl = EiPi(l~Pi) 
Logical entropy of a finite probability distribution p[§ 

Since 1 = (^2*1=1 Pi) 2 = J2iPi + J2ijtjPiPji we again have the logical entropy h (p) as the 
probability ^Z^jPiPj of drawing a distinction in two independent samplings of the probability 
distribution p. This is also clear from defining the product measure on the subsets S C U x U: 

v( s ) = J2iPiPj '■ (ui,uj) e S} 
Product measure on U x U 

Then the logical entropy h (p) = fJ>(l-u) is just the product measure of the dit set of the discrete 
partition on U. There is also the obvious generalization to consider any partition tt on U and 
then define for each block B G ir, ps = Y^ Ui eBPi- Then the logical entropy h (ir) = ^(dit(7r)) is 
the product measure of the dit set of tt (so it is still interpreted as the probability of drawing a 
distinction of tt) and that is equivalent to ^2 b Pb (1 — Pb)- 

For the uniform distribution pi = — , the logical entropy has its maximum value of 1 — ^. 
Regardless of the first draw (even for a different probability distribution over the same n outcomes), 
the probability that the second draw is different is 1 — ^. The logical entropy has its minimum 
value of for p = (1, 0, 0) so that: 

0<fc(p)<l-£. 

An important special case is a set U of \U\ = N equiprobable elements and a partition ir on 
U with n equal-sized blocks of N/n elements each. Then the number of distinctions of elements 
is N 2 — n (~) = A^ 2 — — which normalizes to the logical entropy of h (it) = 1 — ^ and which 
is independent of N. Thus it holds when N = n and we take the elements to be the equal blocks 
themselves. Thus for an equal-blocked partition on a set of equiprobable elements, the normalized 
number of distinctions of elements is the same as the normalized number of distinctions of blocks, 
and that quantity is the: 

h (po) = l-Po = l- £ 
Logical entropy of an equiprobable set of n elements. 

3 This could be taken as the logical entropy h(u) of the random variable it but since the values of u are irrelevant 
(other than being distinct for i ^ j), we can take the logical entropy h (p) as a function solely of the probability 
distribution p of the random variable. 
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2.3 A statistical treatment of logical entropy 

It might be noted that no averaging is involved in the interpretation of h(ir). It is the number 
of distinctions |dit(7r)| normalized. The definition of the logical entropy h(p) = ^27=1 Pi^ 1 (Pi) = 
Y^i=\Pi (1 ~~ Pi) °f a probability distribution p = (pi, ...,p n ) is in the form of the average value of 
the random variable which has the value h (pj) = 1 — Pi with the probability pj. 

Hence the formula can be arrived at by applying the law of large numbers in the form where 
the finite random variable X takes the value Xj with probability p^: 

lim/V->-oo Ylj=i x j = ^27=1 Pi x i- 

At each step j in repeated independent sampling u±U2---un of the probability distribution 
P = {Pli ■■■iPn)i the probability that the j th result Uj was not Uj is 1 — Pr(iij) so the average 
probability of the result being different than it was at each place in that sequence is: 

37 5£l(l -**(»*))• 

In the long run, the typical sequences will dominate where the i th outcome is sampled p.- t N 
times so that we have the value 1 — Pi occurring piN times: 

liniA^oo i J2jLi (! - Pr ( u j)) = 77 Tn=iPi N i 1 ~ Pi) = h (p). 

The logical entropy h (p) = Y2i Pi(^ ~ Pi) = Yli^j PiPj ^ s usually interpreted as the pair- drawing 
probability of getting distinct outcomes from the distribution p = (pi, ...,p n ). Now we have a different 
interpretation of logical entropy as the average probability of being different. 

2.4 A brief history of the logical entropy formula 

The logical entropy formula h (p) = Y2%Pi 0- ~ Pi) = 1 — YliPi * s the probability of getting distinct 
values Ui ^ Uj in two independent samplings of the random variable u. The complementary measure 
1 — h (p) = Y2i Pi 1S the probability that the two drawings yield the same value from U. Thus 
1 — Y2iPi i s a measure of heterogeneity or diversity in keeping with our theme of information as 
distinctions, while the complementary measure ^2iPi is a measure of homogeneity or concentration. 
Historically, the formula can be found in either form depending on the particular context. The p^'s 
might be relative shares such as the relative share of organisms of the i th species in some population 
of organisms, and then the interpretation of pi as a probability arises by considering the random 
choice of an organism from the population. 

According to I. J. Good, the formula has a certain naturalness: 

If pi,...,pi are the probabilities of t mutually exclusive and exhaustive events, any 
statistician of this century who wanted a measure of homogeneity would have take 
about two seconds to suggest £^pf which I shall call p. [TU p. 561] 

As noted by Bhargava and Uppuluri [3], the formula 1 — Y^Pi was used by Gini in 1912 ( [TU] 
reprinted in [11, p. 369]) as a measure of "mutability" or diversity. But another development of 
the formula (in the complementary form) in the early twentieth century was in cryptography. The 
American cryptologist, William F. Friedman, devoted a 1922 book ([9]) to the "index of coincidence" 
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(i.e., YIpI)- Solomon Kullback (of the Kullback-Leibler divergence treated later) worked as an 
assistant to Friedman and wrote a book on cryptology which used the index. [TO] 

During World War II, Alan M. Turing worked for a time in the Government Code and Cypher 
School at the Bletchley Park facility in England. Probably unaware of the earlier work, Turing 
used p = YlPi m his cryptoanalysis work and called it the repeat rate since it is the probability 
of a repeat in a pair of independent draws from a population with those probabilities (i.e., the 
identification probability 1 — h (p)). Polish cryptoanalyists had independently used the repeat rate 
in their work on the Enigma |25j . 

After the war, Edward H. Simpson, a British statistician, proposed ^BenPB as a measure 
of species concentration (the opposite of diversity) where tt is the partition of animals or plants 
according to species and where each animal or plant is considered as equiprobable. And Simpson 
gave the interpretation of this homogeneity measure as "the probability that two individuals chosen 
at random and independently from the population will be found to belong to the same group." [29\ 
p. 688] Hence 1 — ^bgtt p\ i s the probability that a random ordered pair will belong to different 
species, i.e., will be distinguished by the species partition. In the biodiversity literature [26j, the 
formula is known as "Simpson's index of diversity" or sometimes, the Gini-Simpson index 123)/ . 
However, Simpson along with I. J. Good worked at Bletchley Park during WWII, and, according to 
Good, "E. H. Simpson and I both obtained the notion [the repeat rate] from Turing." [I3j p. 395] 
When Simpson published the index in 1948, he (again, according to Good) did not acknowledge 
Turing "fearing that to acknowledge him would be regarded as a breach of security." [141 p. 562] 



In 1945, Albert O. Hirschman ([171 p. 159] and [18]) suggested using \jYlPi as an index of 
trade concentration (where pi is the relative share of trade in a certain commodity or with a certain 
partner). A few years later, Orris Herfindahl [16] independently suggested using YlPi as an index 
of industrial concentration (where pi is the relative share of the i th firm in an industry). In the 
industrial economics literature, the index H = ^2pf is variously called the Hirschman-Herfindahl 
index, the HH index, or just the H index of concentration. If all the relative shares were equal 
(i.e., pi = l/n), then the identification or repeat probability is just the probability of drawing any 
element, i.e., H = l/n, so -jj = n is the number of equal elements. This led to the "numbers 
equivalent" interpretation of the reciprocal of the H index [2]. In general, given an event with 
probability po, the numbers-equivalent interpretation of the event is that it is 'as if an element was 
drawn out of a set Uy po of ^ equiprobable elements (it is 'as if since 1/po need not be an integer). 
This interpretation will be used later in the dit-bit connection. 

In view of the frequent and independent discovery and rediscovery of the formula p = X^Pi or 
its complement 1 — ^ pf by Gini, Friedman, Turing, Hirschman, Herfindahl, and no doubt others, 
I. J. Good wisely advises that "it is unjust to associate p with any one person." [HJ p. 562] 

Two elements from U = ...,%} are either identical or distinct. Gini [TO] introduced ckj as 
the "distance" between the i th and j th elements where dy = 1 for i ^ j and da = 0. Since 1 = 
(pi + ... + p n ) (pi + ... + pn) = Pi+Hijkj PiPji the logical entropy, i.e., Gini's index of mutability, 
h (p) = l — ^2iPi = J2i=ijPiPj, is the average logical distance between a pair of independently drawn 
elements. But one might generalize by allowing other distances dij = du for i ^ j (but always 
da = 0) so that Q = Y^i^j dijPiPj would be the average distance between a pair of independently 
drawn elements from U. In 1982, C. R. (Calyampudi Radhakrishna) Rao introduced precisely this 
concept as quadratic entropy [23]. In many domains, it is quite reasonable to move beyond the 
bare-bones logical distance of dij = 1 for i ^ j (i.e., the complement 1 — 5™ of the Kronecker delta) 
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so that Rao's quadratic entropy is a useful and easily interpreted generalization of logical entropyo 

3 Shannon Entropy 

3.1 Shannon-Hartley entropy of a set 

The Shannon entropy will first be motivated in the usual fashion and then developed from the basic 
logical notion of entropy. Shannon, like Ralph Hartley [15] before him, starts with the question of 
how much " information" is required to single out a designated element from a set U of equiprobable 
elements. This is often formulated in terms of the search [24J for a hidden element like the answer 
in a Twenty Questions game or the sent message in a communication. But being able to always find 
the designated element is equivalent to being able to distinguish all elements from one another. That 
is, if the designated element was in a set of two or more elements that had not been distinguished 
from one another, then one would not be able to single out the designated element. Thus "singling 
out" or "identifying" an element in a set is just another way to conceptualize "distinguishing" all 
the elements of the set. 

Intuitively, one might measure " information" as the minimum number of yes-or-no questions in 
a game of Twenty Questions that it would take in general to distinguish all the possible " answers" 
(or "messages" in the context of communications). This is readily seen in the simple case where 
\U\ = 2 m , i.e., the size of the set of equiprobable elements is a power of 2. Then following the lead 
of Wilkins over three centuries earlier, the 2 m elements could be encoded using words of length m 
in a binary code such as the digits {0, 1} of binary arithmetic (or {A, B} in the case of Wilkins). 
Then an efficient or minimum set of yes-or-no questions needed to single out the hidden element is 
the set of m questions: 



for j = 1, m. Each element is distinguished from any other element by their binary codes differing 
in at least one digit. The information gained in finding the outcome of an equiprobable binary trial, 
like flipping a fair coin, is what Shannon calls a bit (derived from "binary digit"). Hence the 
information gained in distinguishing all the elements out of 2 m equiprobable elements is: 



where po = J^; is the probability of any given element (henceforth all logs to base 2). 

This is usefully restated in terms of partitions. Given two partitions ir = {B} and a = {C} of U, 
their join ir V a is the partition of U whose blocks are the non-empty intersections Bf)C for B € ir 
and C 6 a. The determination of the j th digit in the binary code for the hidden element defines a 
binary partition ttj of U. Then to say that the answers to the m questions above distinguish all the 
elements means that the join, VJLi^i = 1> is the discrete partition on the set U with cardinality 



2 m . Thus we could also take m = log ( ^~ ) as the minimum number of binary partitions necessary 

to distinguish the elements (i.e., to single out any given element). 

In the more general case where \U\ = n is not a power of 2, we extrapolate to the definition of 
H (p ) where p = ~ as: 

4 Rao's treatment also includes (and generalizes) the natural extension to continuous (square-integrable) probability 
density functions / (x): h (/) = 1 — J f (x) 2 dx. 



Is the j digit in the binary code for the hidden element a 1? 



m = log 2 (2 m ) = log 2 (\U\)= log 2 (i) bits 
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#(po)=log(£) =log(n) 
Shannon-Hartley entropy for an equiprobable set U of n elements. 

The definition is further extrapolated to the case where we are only given a probability po so that 
we say that H (po) = log binary partitions are needed to distinguish a set of ^ elements when 
— is not an integer. 

3.2 Shannon entropy of a probability distribution 

This interpretation of the special case of 2 m or more generally l/po equiprobable elements is ex- 
tended to an arbitrary finite probability distribution p = (pi, ...,p n ) by an averaging process. For 
the i th outcome (i = 1, n), its probability pi is "as if it were drawn from a set of — equiprobable 

Pi 

elements (ignoring that ^- may not be an integer for this averaging argument) so the Shannon- 
Hartley information content of distinguishing the equiprobable elements of such a set would be 
log (^j- But that occurs with probability pi so the probabilistic average gives the usual definition 
of the: 

H (?) = E™=1 Pi H (Pi) = E?=l Pi lo S (i) = - E"=l Pi lo S (Pi) 
Shannon entropy of a finite probability distribution p. 

For the uniform distribution p% = \-, the Shannon entropy has it maximum value of log (n) 
while the minimum value is for the trivial distribution p = (1, 0, 0) so that: 

< H(p) < log(n). 



3.3 A statistical treatment of Shannon entropy 

Shannon makes this averaging argument rigorous by using the law of large numbers. Suppose that 
we have a three-letter alphabet {a, b, c] where each letter was equiprobable, p a = pi, = p c = |, in a 
multi-letter message. Then a one-letter or two-letter message cannot be exactly coded with a binary 
0, 1 code with equiprobable 0's and l's. But any probability can be better and better approximated 
by longer and longer representations in the binary number system. Hence we can consider longer 
and longer messages of N letters along with better and better approximations with binary codes. 
The long run behavior of messages u\U2--.un where Ui G {a,b,c} is modeled by the law of large 
numbers so that the letter a will tend to occur p a N = |iV times and similarly for b and c. Such a 
message is called typical. 

The probability of any one of those typical messages is: 

or, in this case, 



iV 
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Hence the number of such typical messages is 3^. 

If each message was assigned a unique binary code, then the number of 0, l's in the code would 
have to be X where 2 X = 3 N or X = log (3^) = iVlog(3). Hence the number of equiprobable 
binary questions or bits needed per letter of the messages is: 



iVlog(3)/JV = log (3) = 3 x I log = H (p). 



This example shows the general pattern. 

In the general case, let p = (pi,...,p n ) be the probabilities over a n-letter alphabet A = 
{a\, a n }. In an TV-letter message, the probability of a particular message u\U2-.-un is Pr (ui) 
where Ui could be any of the symbols in the alphabet so if Ui = a,j then Pr (u^ = pj. 

In a typical message, the i th symbol will occur piN times (law of large numbers) so the proba- 
bility of a typical message is (note change of indices to the letters of the alphabet): 



l N 



Since the probability of a typical message is P N for P = Il^ =l p^ k , the typical messages arc 



equiprobable. Hence the number of typical messages is Ii^ =l p k and assigning a unique binary 



N 



code to each typical message requires X bits where 2 X = H k= iP k 



-Pk 



N 



where: 



X = log 



Pk 



N 



TV log 



"LA 



= N ELi log (Pk Pk ) =NJ2 k ~Pk log (p fc ) 
= NEkPklog(±) =NH(p). 



Hence the Shannon entropy H (p) = ^2k=iPk^og yj^J is interpreted as the limiting average 
number of bits necessary per letter in the message. In terms of distinctions, this is the average 
number of binary partitions necessary per letter to distinguish the messages. It is this averaging 
result that allows us to consider "the number of binary partitions it takes to distinguish the elements 
of U" when \U\ is not a power of 2 since "number" is interpreted as "average number." 



3.4 Shannon entropy of a partition 

Shannon entropy can also be defined for a partition ir = {B} on a set U. If the elements of U 
are equiprobable, then the probability that a randomly drawn element is in a block B € tt is 
p B = j^j. In a set of ^ equiprobable elements, it would take (on average) H (ps) = log i^^j 
binary partitions to distinguish the elements. Averaging over the blocks, we have the: 

H(7T) = J2 Be7T PBlog(^) 

Shannon entropy of a partition ir. 
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3.5 Shannon entropy and statistical mechanics 

The functional form of Shannon's formula is often further "justified" or "motivated" by asserting 
that it is the same as the notion of entropy in statistical mechanics, and hence the name "entropy." 
The name "entropy" is here to stay but the justification of the formula by reference to statistical 
mechanics is not quite correct. The connection between entropy in statistical mechanics and Shan- 
non's entropy is only via a numerical approximation, the Stirling approximation, where if the first 
two terms in the Stirling approximation are used, then the Shannon formula is obtained. 

The first two terms in the Stirling approximation for ln(iV!) are: ln(iV!) ~ Nln(N) — N. The 
first three terms in the Stirling approximation are: In (N\) ~ N(\n(N) — 1) + \ In (2tvN). 

If we consider a partition on a finite U with \U\ = N, with n blocks of size N\,..., N n , then the 
number of ways of distributing the individuals in these n boxes with those numbers Ni in the i th 
box is: W = Nl \ x Nl xN \ ■ The normalized natural log of W, S = In (W) is one form of entropy in 
statistical mechanics. Indeed, the formula " S = /clog (W)" is engraved on Boltzmann's tombstone. 

The entropy formula can then be developed using the first two terms in the Stirling approxi- 
mation. 

S = * In (W) = i In ( 7 vrr£W) = TT M™) ~ E 4 

« jf [N [In (TV) - 1] - Ei Nr [In (N { ) - 1]] 
= i [N ln(iV) -J2Ni HNi)} = F E N i ln ( N ) ~ E N i ln 
= Ef ln(^)=E^ln(i)=F e (p) 

where Pi = ^ (and where the formula with logs to the base e only differs from the usual base 2 for- 
mula by a scaling factor). Shannon's entropy H e (p) is in fact an excellent numerical approximation 
to S = jf ln (W) for large N (e.g., in statistical mechanics). 

But the common claim is that Shannon's entropy has the same functional form as entropy in 
statistical mechanics, and that is simply false. If we use a three-term Stirling approximation, then 
we obtain an even better numerical approximation! 

S=j?ln(W)~H e (p) + ±ln( J £^ 

but no one would suggest using that " more accurate" entropy formula in information theory. Shan- 
non's formula should be justified and understood by the arguments given previously, and not by 
over-interpreting the approximate relationship with entropy in statistical mechanics. 

3.6 The basic dit-bit connection 

The basic datum is "the" set U n of n elements with the equal probabilities po = ^. In that basic 
case of an equiprobable set, we can derive the dit-bit connection, and then by using a probabilistic 
average, we can develop the Shannon entropy, expressed in terms of bits, from the logical entropy, 
expressed in terms of (normalized) dits, or vice-versa. 

Given U n with n equiprobable elements, the number of dits (of the discrete partition on U n ) is 
n 2 — n so the normalized dit count is: 

5 For the case n — 2, MacKay |21l p. 2] also uses Stirling's approximation to give a "more accurate approximation" 
(using the next term in the Stirling approximation) to the entropy of statistical mechanics than the Shannon entropy. 
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h (po) = h (i) =1— po = 1 — ^ normalized dits. 

That is the dit-count or logical measure of the information is a set of n distinct elements 

But we can also measure the information in the set by the number of binary partitions it takes 
(on average) to distinguish the elements, and that bit-count is: 

H (po) =H{1)= log U) = log (n) bits. 



By solving the dit-count and the bit-count for po and equating, we can derive each measure in 
terms of the other: 



H (po) = log [tzk^)) and h (p ) = 1 - 
The dit-bit conversion formulas. 

The common thing being measured is an equiprobable U n where n = The dit-count for U n 

is h (po) = 1 — n ana - * ne bit-count for U n is H (po) = log f , and the bit-dit connection gives the 
relationship between the two counts. Using this dit-bit connection between the two different ways 
to measure the "information" in U n , each entropy can be developed from the other. 

We start with the logical entropy of a probability distribution p = (pi , .. . , p n ) : h (p) = Y17=i P^ (P*) ■ 
It is expressed as the probabilistic average of the dit-counts or logical entropies of the sets Ui/ Pi 
with — equiprobable elements^ But if we switch to the binary-partition bit-counts of the in- 

Pi 

formation content of those same sets Uyp. of ^- equiprobable elements, then the bit-counts are 

H (pi) = log ^^-^ and the probabilistic average is the Shannon entropy: H (p) = Y17=i Pi^ (pi) ■ 
Both entropies have the mathematical form: 

YliPi (measure of info, in C^i/ Pi ) 

and differ by using either the dit-count or bit-count to measure the information in U\ / Pi . 

Clearly the process is reversible, so one can use the dit-bit connection in reverse to develop the 
logical entropy h (p) from the Shannon entropy H (p) . Thus the two notions of entropy are simply 
two different ways, using distinctions (dit-counts) or binary partitions (bit-counts), to measure the 
information in a probability distribution. 

Moreover the dit-bit connection carries over to the compound notions of entropy so that the 
Shannon notions of conditional entropy, mutual information, and joint entropy can be developed 
from the corresponding notions for logical entropy. Since the logical notions are the values of 
a probability measure, the compound notions of logical entropy have the usual Venn diagram 
relations such as the inclusion-exclusion principle. There is a well-known analogy between the 
"Venn diagram" relationships for the Shannon entropies and the relationships satisfied by any 
measure on a set ([I], [5]). As L. L. Campbell puts it, the analogy: 



6 The context will determine whether "dit-count" refers to the "raw" count |dit (ty) \ or the normalized count nfxi/r ■ 
7 Starting with the datum of the probability pt, there is no necessity that n — is an integer so the dit-counts for 

Ui/p. are extrapolations while the bit-counts or binary partition counts for U n are already extrapolations even when 

n is an integer but not a power of 2. 
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suggests the possibility that H (a) and H (/3) are measures of sets, that H (a, (3) is 
the measure of their union, that I (a, (3) is the measure of their intersection, and that 
H (a\j3) is the measure of their difference. The possibility that / (a, j3) is the entropy 
of the "intersection" of two partitions is particularly interesting. This "intersection," if 
it existed, would presumably contain the information common to the partitions a and 
P- H3] 

All of Campbell's desiderata are precisely true when: 

• "sets" = dit sets, and 

• "entropies" = normalized counting measure of the (dit) sets, i.e., the logical entropies. 

Since the logical entropies are the values of a measure, by developing the corresponding Shannon 
notions from the logical ones, we have an explanation of why the Shannon notions also exhibit the 
same Venn diagram relationships. 

The expository strategy is to first develop the Shannon and logical compound notions of entropy 
separately and then to show the relationship using the dit-bit connection. 



4 Conditional entropies 
4.1 Logical conditional entropy 

Given two partitions tt = {B} and a = {C} on a finite set U, how might one measure the new 
information that is provided by tt that was not already in <r? Campbell suggests associating sets with 
partitions so the conditional entropy would be the measure of the difference between the sets. Taking 
the information as distinctions, we take the difference between the dit sets, i.e., dit (tt) — dit(cr), 
and then take the normalized counting measure of that subset of dit (it) — dit (a) QU x U: 

hfrla) = |ditW-dit( g )| 
Logical conditional entropy of tt given a. 



When the two partitions tt and a are joined together in the join tt V a, whose blocks are the 
non-empty intersections B n C, their information as distinctions is also joined together as sets, 
dit (tt V a) = dit (tt) U dit (a) (the "union" mentioned by Campbell), which has the normalized 
counting measure of: 

h(ir V a) = l dlt Mudit(<x)| = J2 Be7T , Cea PBnc [1 - PBnc] 
logical entropy of a partition join tt V a. 

This logical entropy is interpreted as the probability that a pair of random draws from U will yield 
a 7r-distinction or a ^-distinction (where "or" includes both). 

Then the relationships between the logical entropy concepts can be read off the Venn diagram 
inclusion-exclusion principle for the dit sets: 

| dit (vr) | + | dit (a) \ = |dit (tt V a) \ + |dit (tt) n dit (cr)| 
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so that 

|dit(vr) -dit (cr)| = |dit (vr)| - |dit (tt) D dit (cr)| = |dit (vrVcr)| - |dit (a)\. 




Figure 1: Venn diagram for subsets of U x U 

The shaded area in the Venn diagram has the dit-count measure: 

| dit (tt) - dit (<j)| = | (dit (tt) U dit (a))\ - |dit (<r) \ 
h(n\a) = h(ir Vcr) - h(a). 

For the corresponding definitions for random variables and their probability distributions, con- 
sider a random variable (x, y) taking values in the product X x Y of finite sets with the joint 
probability distribution p(x,y), and thus with the marginal distributions: p(x) = Y^yeY P i x ^v) 
and p (y) = YlxexP ( x > V)- For notational simplicity, the entropies can be considered as functions of 
the random variables or of their probability distributions, e.g., h(p(x,y)) = h(x,y). For the joint 
distribution, we have the: 

h (x, y) = h(p (x, y)) = J2xex, y eY P v) I 1 ~ P ( x , v)\ 
logical entropy of the joint distribution 

which is the probability that two samplings of the joint distribution will yield a pair of distinct 
ordered pairs (x,y), (x',y') € X x Y, i.e., with an X-distinction x ^ x' or a y-distinction y ^ y' . 

For the definition of the conditional entropy h(x\y), we simply take the product measure of 
the set of pairs (x,y) and (x',y') that give an X-distinction but not a y-distinction. Thus given 
the first draw (x,y), we can again use a Venn diagram to compute the probability that the second 
draw (a/, y') will have x' ^ x but y' = y. 

To illustrate this using Venn diagram reasoning, consider the probability measure defined by 
p(x,y) on the subsets of X x Y. Given the first draw (x,y), the probability of getting an (x,y)- 
distinction on the second draw is 1 — p (x, y) and the probability of getting a y-distinction is 1 — p (y). 
A draw that is a y-distinction is, a fortiori, an (x, y)-distinction so the area 1 — p (y) is contained in 
the area 1 — p (x, y). Then the probability of getting an (x, y)-distinction that is not a y-distinction 
on the second draw is the difference: (1 — p (x, y)) — (1 — p (y)) = p{y) — p (x, y). 
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Figure 2: (1 - p (x, y)) - (1 - p (y)) 
= probability of an x-distinction but not a y-distinction on X x Y. 

Since the first draw (x, y) was with probability p (x, y), we have the following as the product measure 
of the subset of [X x Y] 2 of pairs [(x,y) , (x',y')] that are X-distinctions but not Y-distinctions: 

h (x\y) = T,x, y p( x i y) [C 1 - p (». ?/)) - (i - p (y))] 

logical conditional entropy of x given y. 

Then a little algebra quickly yields: 

h (x\y) = Y, x ,yP (x, y) [(1 - p (x, y)) - (1 - p (y))] 
1 " Y, x ,yP(x,y) 2 ] - [l - E y P(y) 2 ] =h{x,y) -h(y). 

The summation over p (x, y) recasts the Venn diagram to the set (X x Y) 2 where the product 
probability measure (for the two independent draws) gives the logical entropies: 




Figure 3: h (x\y) = h(x,y)-h (y). 

It might be noted that the logical conditional entropy, like the other logical entropies, is not 
just an average; the conditional entropy is the product probability measure of the subset: 

{[(x, y) , (x', y')\ : x + x', y = y'} C (X x Y) x (X x Y). 
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4.2 Shannon conditional entropy 

The Shannon conditional entropy for partitions it and a is based on subset reasoning which is then 
averaged over a partition. Given a subset C G a, a partition ir = {B} Bew induces a partition of 
C with the blocks {B n C} Be7T . Then p#|c = PF ^ C is the probability distribution associated with 

that partition so it has a Shannon entropy which we denote: H (ir\C) = ^^ e7r P_B|c l°g (^f^) = 

Eb P pg C l°g ( pane ) ' ^ e Shannon conditional entropy is then obtained by averaging over the 
blocks of cr: 

H(kW) = ZceaPcH(MC) = Zb,c PBnc log (^) 
Shannon conditional entropy of 7r given a. 

Since the join ir V a is the partition whose blocks are the non-empty intersections B C\C, 

H(ttVo-) = E b , c PBnc log (^) . 

Developing the formula gives: 

H(n\<j) = Ec bdog(pc) - Eb PBnc log (PBnc)] = H (tt V cr) - H (a) . 

Thus the conditional entropy H (tt\<j) is interpreted as the Shannon-information contained in the 
join 7r V a that is not contained in a. 



H(tt.vg) 








H(7t|o) ( 


H(G) j 



Figure 4: il (vr|cr) = H (vr V cr) - H (a) 
"Venn diagram picture" for Shannon conditional entropy of partitions 

Given the joint distribution p (x, y) on X x Y, the conditional probability distribution for a spe- 
cific y £ Y is p(x\y ) = which has the Shannon entropy: H (x\y ) = J2 X P ( x \vo) l°g (^^y)- 
Then the conditional entropy is the average of these entropies: 

H (x\y) = Z yP (y) Ex ^ log ($gy) = £ X>S P (*, y) log (^) 
Shannon conditional entropy of x given y. 

Expanding as before gives H (x\y) = H(x,y) — H (y) with a similar Venn diagram picture (see 
below) . 
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4.3 Shannon conditional entropy from logical conditional entropy 



Now we can develop the Shannon conditional entropy from the logical conditional entropy and 
thereby explain the Venn diagram relationship. The logical conditional entropy is: 



h (x\y) = Y, x ,yP( x , y) [C 1 - p v)) - C 1 - p (y))] 

where l—p(x, y) is the normalized dit count for the discrete partition on a set E/i/ p ( X)J ,) with p ^ ^ 

equiprobable elements. Hence that same equiprobable set requires the bit-count of log ( j^r^y ) 
binary partitions to distinguish its elements. Similarly 1 — p (y) is the normalized dit count for 



(the discrete partition on) a set Uy p ^ with equiprobable elements, so it requires log y-pj^ 

binary partitions to make those distinctions. Those binary partitions are included in the log ( p ( x 
binary partitions (since a y-distinction is automatically a (x, y)-distinction) and we don't want the 
y-distinctions so they are subtracted off to get: log 



log ( pjyjj bits. Taking the same 



probabilistic average, the average number of binary partitions needed to make the x-distinctions 
but not the y-distinctions is: 

E x , y p (*> y) [^g (i^y) - log = Ex )W p( s > y) lo § (Mj) = H ■ 

Replacing the dit-counts by the bit-counts for the equiprobable sets, and taking the probabilistic 
average gives the same Venn diagram picture for the Shannon entropies. 




Figure 5: H (x\y) = H (x,y) — H (y) 



5 Mutual information for logical entropies 
5.1 The case for partitions 

If the "atom" of information is the distinction or dit, then the atomic information in a partition 
7r is its dit set, dit(7r). Following again Campbell's dictum about the mutual information, the 
information common to two partitions ir and a would naturally be the intersection of their dit sets: 

Mut(vr, a) = dit (vr) n dit (a) 
Mutual information set. 
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It is an interesting and not completely trivial fact that as long as neither tt nor a are the indiscrete 
partition (where dit (0) = 0), then tt and a have a distinction in common. 



Proposition 1 (Non-empty dit sets intersect) Given two partitions tt and a on U with non- 
empty dit sets, dit (tt) n dit (<x) 7^ 00 

Since tt is not the indiscrete partition, consider two elements u and v! distinguished by tt but 
identified by a [otherwise (u,u') £ dit (w) fl dit(a)]. Since a is also not the indiscrete partition, 
there must be a third element u" not in the same block of a as u and u'. But since u and u' are 
in different blocks of tt, the third element u" must be distinguished from one or the other or both 
in tt. Hence (u, u") or (u', u") must be distinguished by both partitions and thus must be in their 
mutual information set Mut (tt, a) = dit (tt) n dit (<?).□ 

The dit sets dit (tt) and their complementary indit sets (= equivalence relations) indit (ir) = 
U 2 — dit (tt) are easily characterized as: 



indit (vr) = (J B x B 
dit (tt) = U B x B' = U x U — indit (vr) = indit (tt) c . 

B^B';B,B'eTT 

The mutual information set can also be characterized in this manner. 

Proposition 2 (Structure of mutual information sets) Given partitions tt and a with blocks 
{^Ibgtt and {C}cea> then 

Mut(vr,a)= (J (B - (B n C)) x (C - (B n C)) = \J (B-C)x(C-B). 

B£ir,C£cr B£n,C€a 

The union (which is a disjoint union) will include the pairs (u, u') where for some B £ tt and 
C £ a, u £ B — (B DC) and u' £ C — (B D C). Since u' is in C but not in the intersection B DC, 
it must be in a different block of tt than B so (u,u') £ dit(7r). Symmetrically, (u,u r ) £ dit (a) so 
(u, u') £ Mut (tt, a) = dit (tt) ndit (a). Conversely if (u, u') £ Mut (tt, a) then take the B containing 
u and the C containing v! . Since (u,v!) is distinguished by both partitions, u £ C and v! B so 
that (u, u') £(B-(BH C)) x (C - (B n C)).D 

The probability that a pair randomly chosen from U x U would be distinguished by tt and a 
would be given by the normalized counting measure of the mutual information set which is the: 

m(TT, a) = l dlt ( 7 0^ lt ( (T )l — probability that tt and a distinguishes 
Mutual logical information of tt and a. 

By the inclusion-exclusion principle: 

|Mut(7r,cr)| = I dit (vr) ndit (a)\ = |dit(vr)| + |dit(<r)| - |dit(vr) U dit (a)\. 



The contrapositive of the "non-empty dit sets intersect" proposition is also interesting. Given two equivalence 
relations E\,Ei C U 2 , if every pair of elements u, u' G U is equated by one or the other of the relations, i.e., 
Ei UE 2 = U 2 , then either £1 = U 2 or E 2 = U 2 . 
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Normalizing, the probability that a random pair is distinguished by both partitions is given by the 
inclusion-exclusion principle: 



Idit (tt) n dit (a) 
m (tt, a) = — 



dit (tt) I |dit(cr)| |dit (tt) U dit (a)\ 



+ 



\u\ 2 \u\ 2 \u\ 2 

= h(ir) + h(a) - h(ir V a) . 

Inclusion-exclusion principle for logical entropies of partitions 

This can be extended after the fashion of the inclusion-exclusion principle to any number of parti- 
tions. It was previously noted that the intersection of two dit sets is not necessarily the dit set of a 
partition, but the interior of the intersection is the dit set dit (tt A a) of the partition meet tt A a. 
Hence we also have the: 



h (tt A ct) < h (tt) + h (a) - h (tt V a) 
Submodular inequality for logical entropies. 



5.2 The case for joint distributions 

Consider again a joint distribution p(x,y) over X x Y for finite X and Y. Intuitively, the mutual 
logical information m (x, y) in the joint distribution p (x, y) would be the probability that a sampled 
pair (x, y) would be a distinction of p (x) and a distinction of p (y). That means for each probability 
p(x,y), it must be multiplied by the probability of not drawing the same x and not drawing the 
same y (e.g., in a second independent drawing). In the Venn diagram, the area or probability of 
the drawing that x or that y is p (x) + p (y) — p (x, y) (correcting for adding the overlap twice) so 
the probability of getting neither that x nor that y is the complement 1 — p (x) — p (y) + p (x, y) = 
[l-p(x)] + [l-p(y)]-[l-p(x,y)}. 




Figure 6: [1 - p (x)] + [1 - p (y)] - [1 - p (x, y)] 
= shaded area in Venn diagram for X x Y 



Hence we have: 



m (x, y) = J2 x ,yP(x, y) [[1 - V (x)] + [1 - V (v)\ ~ [1 - P (x, y)}] 
Logical mutual information in a joint probability distribution. 
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The probability of two independent draws differing in either the x or the y is just the logical 
entropy of the joint distribution: 

h (x, y) = h(p {x, yj) = J2 x , y P(x, y)[l-p (x, y)] = l- J2 x , y P(x, V? '■ 
Using a little algebra to expand the logical mutual information: 



m (x, y)=\i- J2 x , y p (x, y) p (x)\ + [1 - J2 x , y p (x, y) p (y)\ - [1 - J2 x , y p (x, y? 

= h(x) + h(y) -h(x,y) 
Inclusion-exclusion principle for logical entropies of a joint distribution. 




Figure 7: m (x, y) = h (x) + h(y) — h (x, y) 
= shaded area in Venn diagram for {X x Y) 2 . 

It might be noted that the logical mutual information, like the other logical entropies, is not 
just an average; the mutual information is the product probability measure of the subset: 

{[(*, y) , (x', y>)\ : x ± x', y ± y'} C (X x Y) x (X x Y). 

6 Mutual information for Shannon entropies 
6.1 The case for partitions 

The usual heuristic motivation for Shannon's mutual information is much like its development 
from the logical mutual information so we will take that approach at the outset. The logical mutual 
information for partitions can be expressed in the form: 

m (tt, °") = Y,B,cPBnc [(1 - Pb) + (1 - Pc) - (1 - PBnc)} 
so if we substitute the bit-counts for the dit-counts as before, we get: 



I(ir,<r) = E B ,cPBnc [log + log (-M - log 



i 



K PB J ' ° \PC J ° \PBnC 

Shannon's mutual information for partitions. 



D B ,c PBnc log (2|0£) 
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Keeping the log's separate gives the Venn diagram picture: 



I(k,<j) = Vpsnc log ( — J +log ( — J - log ( J 

5^ L \PbJ \PcJ \PBncJ_ 

= H (tt) + H (a) - H (tt V a) 
Inclusion-exclusion analogy for Shannon entropies of partitions. 

6.2 The case for joint distributions 

To move from partitions to probability distributions, consider again the joint distribution p (x, y) 
on X x Y. Then developing the Shannon mutual information from the logical mutual information 
amounts to replacing the block probabilities PBnC m the join 7tVcj by the joint probabilities p (x, y) 
and the probabilities in the separate partitions by the marginals (since pb = J2ceaPBnc an d 
PC = Y^BewPBnc), to obtain: 

I {x, y) = E x , y P(x, y) log {0^) 
Shannon mutual information in a joint probability distribution. 

Then the same proof carries over to give the: 

I(x,y) = H(x) + H(y)-H(x,y) 




Figure 8: Inclusion-exclusion "picture" for Shannon entropies of probability distributions. 
The logical mutual information formula: 

m (x, y) = Y. x ,yP{x, y) [[1 - P (x)} + [1 - p (y)} - [1 - p (x, y)]\ 
develops via the dit-count to bit-count conversion to: 

V) [ lo S (fr) + lo § " lo S {itkfi)] = ^ x , y P(x, y) log (^gjgy) = / (x, y). 

Thus the genuine Venn diagram relationships for the product probability measure that gives 
the logical entropies carry over, via the dit-count to bit-count conversion, to give a similar Venn 
diagram picture for the Shannon entropies. 
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7 Independence 



7.1 Independent Partitions 

Two partitions tt and a are said to be (stochastically) independent if for all B G tt and C £ a, 
PBnC = PbPc- If 7T an d c are independent, then: 

7 (tt, a) = £ Se7r , C6(T Psnc log (2f^) = = (tt) + (a) - H (tt V a) , 

so that: 

H(tt\Jo) = H(ir) + H(a) 
Shannon entropy for partitions additive under independence. 

In ordinary probability theory, two events E,E' C U for a sample space U are said to be 
independent if Pr (E H E') = Pr (.E) Pr (E'). We have used the motivation of thinking of a partition- 
as-dit-set dit (tt) as an "event" in a sample space U x U with the probability of that event being 
h(ir), the logical entropy of the partition. The following proposition shows that this motivation 
extends to the notion of independence. 

Proposition 3 (Independent partitions have independent dit sets) IJtt and a are (stochas- 
tically) independent partitions, then their dit sets dit (it) and dit (a) are independent as events in 
the sample space U x U (with equiprobable points). 

For independent partitions tt and a, we need to show that the probability m(iT, a) of the event 
Mut (tt, a) = dit (tt) n dit (a) is equal to the product of the probabilities h (tt) and h (a) of the events 
dit (tt) and dit (a) in the sample space UxU. By the assumption of stochastic independence, we have 
= PBnc = PbPc = ^Tpr so that \B fl C\ = \B\\C\ / \ U\. By the previous structure theorem 
for the mutual information set: Mut (tt, a) = [j (B - (B n C)) x (C - (B n C)), where the 

-B€7T,Ce<T 

union is disjoint so that: 



so that: 



|Mut (tt,<t)\ = Ese^Ce, (\B\ ~ \B fl C|) (|C| - |5 n C|) 

\B\ \C\\ /,„, 151 ICI 



— E,Be7r, ceo- ^ 



51 



1171 



CI 



|C/|- 
1 

1 



>X. B&c ,ce*\B\{\U\-\C\)\C\{\U\-\B\) 
^Be,\ B W U - B \T,cea\C\\U-C\ 



|dit(vr)| |dit (a) 



m(TT, a) = M|1Z^ Jyy ^^fe (,).□ 
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Hence the logical entropies behave like probabilities under independence; the probability that tt 
and a distinguishes, i.e., m (7r, a), is equal to the probability h (tt) that tt distinguishes times the 
probability h (a) that a distinguishes: 

m(7r, a) = h (tt) h (a) 
Logical entropy multiplicative under independence. 

It is sometimes convenient to think in the complementary terms of an equivalence relation 
"equating" or "identifying" rather than a partition distinguishing. Since h (tt) can be interpreted as 
the probability that a random pair of elements from U are distinguished by tt, i.e., as a distinction 
probability, its complement 1 — h (tt) can be interpreted as an identification probability, i.e., the 
probability that a random pair is equated by tt (thinking of tt as an equivalence relation on U). In 
general, 

[l-h (tt)} [l-h (a)] = l-h(TT)-h(a) + h(Tr)h (a) = [1 - h (tt V a)] + [h (tt) h (a) - m(ir, a] 

which could also be rewritten as: 

[1 - h (tt V a)] - [1 - h (tt)] [l-h (a)} = m(vr, a) - h (tt) h (a). 

Thus if tt and a are independent, then the probability that the join partition tt V a identifies is the 
probability that tt identifies times the probability that a identifies: 

[l-h (tt)} [l-h (a)} = [1 - h (tt V a)} 
Multiplicative identification probabilities under independence. 

7.2 Independent Joint Distributions 

A joint probability distribution p (x, y) on X x Y is independent if each value is the product of the 
marginals: p(x,y) =p(x)p(y). 

For an independent distribution, the Shannon mutual information 

I (x, y) = E x ex, v eY P v) lo S {0^)) 

is immediately seen to be zero so we have: 

H(x,y) =H(x) + H(y) 
Shannon entropies for independent p(x,y). 

For the logical mutual information, independence gives: 

m (x, y) = Y. x , y P i x , V) I 1 - P ( x ) ~p(v)+P (%, y)] 

= Y, x ,yP fa) P (y) [l-p(x)-p (y) +p(x)p (y)] 

= Y, x p 0*0 [i - p ( x )} Y, y p (y) [i - p (y)] 

= h(x)h (y) 
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Logical entropies for independent p(x,y). 

This independence condition m (x, y) = h (x) h (y) plus the inclusion-exclusion principle m (x, y) = 
h(x) + h (y) — h [x, y) also implies that: 

[l-h(x)\ [l-h(y)] = l-h{x)-h{y) + h{x)h(y) 
= 1 — h (x) — h(y) + m (x, y) 
= l- h(x,y). 

Hence under independence, the probability of drawing the same pair (x, y) in two independent 
draws is equal to the probability of drawing the same x times the probability of drawing the same 

y- 

8 Cross-entropies and divergences 

Given two probability distributions p = (pi,—,p n ) and q = (q%, q n ) on the same sample space 
{1, ...,n}, we can again consider the drawing of a pair of points but where the first drawing is 
according to p and the second drawing according to q. The probability that the points are distinct 
would be a natural and more general notion of logical entropy that would be the: 

Logical cross entropy of p and q 

which is symmetric. The logical cross entropy is the same as the logical entropy when the distribu- 
tions are the same, i.e., if p = q, then h (p\\q) = h (p). 

The notion of cross entropy in Shannon entropy can be developed by applying dit-bit connection 
to the logical cross entropy YliPiO- ~ 1i) to obtain: 

^(plk) = EiWiog(i) 

which is not symmetrical due to the asymmetric role of the logarithm, although if p = q, then 
H (p\\q) = H(p). Since the logical cross entropy is symmetrical, it could also be expressed as 

q% (1 — Pi) which develops to the Shannon cross entropy H (q\\p) = Yli Qi 1°§ (jp) so ^ might be 
more reasonable to use a symmetrized cross entropy: 

H s (p\\q) = ±[H(p\\q)+H(q\\p)]. 

The Kullback-Leibler divergence (or relative entropy) D (p\\q) = J^Pilog is defined as a 

measure of the distance or divergence between the two distributions where D (p\\q) = H (p\\q) — 
H (p). A basic result is the: 

D (p\\q) > with equality if and only if p = q 
Information inequality [6l p. 26]. 
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Given two partitions tt and a, the inequality I (it, <j) > is obtained by applying the information 
inequality to the two distributions {pbhc} and {pbPc} ° n the sample space {(B, C) : B G tt, C S o - } = 
7T x cr: 

J fa ^) = E B ,cPBnc log (g^s) = D ({pBnc} II {pbPc}) > 
with equality iff independence. 

In the same manner, we have for the joint distribution p (x, y): 

I (x, y) = D(p (x, y) \ \p (x) p(y))>0 
with equality iff independence. 

The symmetrized Kullback-Leibler divergence is: 

D s (p\\q) = \ [D ( P \\q) + D (q\\p)] = H s (p\\q) - 

But starting afresh, one might ask: "What is the natural measure of the difference or distance 
between two probability distributions p = (pi,...,p n ) and q = (qi,---,q n ) that would always be 
non-negative, and would be zero if and only if they are equal?" The (Euclidean) distance between 
the two points in M. n would seem to be the "logical" answer — so we take that distance (squared 
with a scale factor) as the definition of the: 

d(p\\q) = lEife -iif n 

Logical divergence (or logical relative entropy]^] 

which is symmetric and we trivially have: 

d (p\ \q) > with equality iff p = q 
Logical information inequality. 

We have component-wise: 

< (pi - q t f = pi - 2 Pm + qf = 2 [I - Pm ] - [I - pj] - [I - qf] 
so that taking the sum for i = 1, n gives: 

1 2 

d(p\\q) = (Pi-qi) 

= [i - £ift«] - \ [(i - EiPt) + (i - E,% 2 )] 
= Hp\\q)- Hp)+ 2 h[q) - 

Logical divergence = Jensen difference [23l p. 25] between probability distributions. 
9 In [7], this definition was given without the useful scale factor of 1/2. 



H(p)+H(q) 
2 



27 



Then the information inequality implies that the logical cross-entropy is greater than or equal to 
the average of the logical entropies: 

h (p\\q) > Mp) + M<?) with equality iff p = q. 

The half-and-half probability distribution 2±2 that mixes p and q has the logical entropy of 



h(p\\q) , h(p)+h(q) _ 1 
2 " r 4 2 



MpII<?) + 



2 



so that: 



fe(p)+fe(q) 



with equality iff p = q. 



h(p\\q)>h{*?)> , 
Mixing different p and q increases logical entropy. 

The logical divergence can be expressed as: 

d (Pk) = \ Ei Vr (1 " ft) + Ei ft (1 " Pi)] " 5 [(Ei Pi (i " ^)) + (Ei ft (i " ft))] 
that develops via the dit-bit connection to: 



Dialog +Eiftiog(^-) -EiPiiog -Eiftiog(^) 
Epilog (|) +Eifti«g 



21^ 

Pi / 



\[D{p\\q) + D{q\\p)] 

= D s (p\\q). 

Thus the logical divergence d(p\\q) develops via the dit-bit connection to the symmetrized version 
of the Kullback-Leibler divergence. 



9 Summary and concluding remarks 

The following table summarizes the concepts for the Shannon and logical entropies. We use the 
case of probability distributions rather than partitions, and we use the abbreviations p xy = p(x, y), 
Px =p(x), and p y =p{y). 





Shannon Entropy 


Logical Entropy 


Entropy 


h (p) =EPi 1 °g( 1 M) 


h{p) = Y.Pi (! - Pi) 


Mutual Info. 


I(x,y) = H(x) +H(y)-H(x,y) 


m (x, y)=h (x) +h (y) -h (x, y) 


Independence 


l(x,y) = 


m (x, y) = h (x) h (y) 


Indep. Relations 


H(x,y) = H(x) +H{y) 


l-h(x,y) = [l-h(x)] [l-h(y)] 


Cond. entropy 




h ( x \y) = Y,x,yPxy[(Py-Pxy)] 


Relationships 


H (x\y) = H (x, y) —H (y) 


h(x\y)= h(x,y)-h(y) 


Cross entropy 


ff(p||g) = EPiiog(i/ft) 


h{p\\q) = Y.Pi (! - ft) 


Divergence 


D{p\\q)=Y,iPi^g{^) 


d{p\\q)=\T,i{Pi - ft) 2 


Relationships 


D(p\\q) = H(p\\q)-H(p) 


d(p\\q)=h(p\\q)-[h(p)+h(q)} /2 


Info. Inequality 


D (p\\q) > with = iff p = q 


d (p\\q) > with = iff p = q 


Ta 


ole of comparisons between Shannon and logical entropies 
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The above table shows many of the same relationships holding between the various forms of 
the logical and Shannon entropies due ultimately to the dit-bit connection. The dit-bit connection 
between the two notions of entropy is based on them being two different measures of the "amount 
of information-as-distinctions," the dit-count being the normalized count of the distinctions and 
the bit-count being the number of binary partitions required (on average) to make the distinctions. 

Logical entropies arise naturally as the normalized counting measure for partition logic just as 
probabilities arise as the normalized counting measure for subset logic, where the two logics arc 
dual to one another. All the forms of logical entropy have simple interpretations as the probabilities 
of distinctions. Shannon entropy is a higher-level and more refined notion adapted to the theory of 
communications and coding where it can be interpreted as the average number of bits necessary 
per letter to code the messages, i.e., the average number of binary partitions necessary per letter 
to distinguish the messages. 
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